Lessons learned from papers
Exam Prep SSD
Lecturer seems sound and quite good
-
Overview of general distributed scalable systems
-
Search engines (crawl, index and search)
-
Social Networking (response time, large amount of data)
-
Cloud Computing (availability and access to scalable resources)
-
CDNs (Scalable web hosting, file distribution media streaming)
-
Design, data centres and cloud computing, scalable storage and querying, compute
-
These are the papers for storage and querying:
– "Bigtable: A Distributed Storage System for Structured Data", Seventh Symposium on Operating
System Design and Implementation (OSDI), Seattle, WA, November, 2006
– "Dynamo: Amazon's Highly Available Key-Value Store", ACM Symposium on Operating Systems
Principles (SOSP), Stevenson, WA, October 2007
– "Spanner: Google's Globally-Distributed Database", Tenth Symposium on Operating System Design
and Implementation (OSDI), Hollywood, CA, October, 2012
-
Papers for Scalable compute:
– "MapReduce: Simplified Data Processing on Large Clusters", Sixth Symposium on Operating
System Design and Implementation (OSDI), San Francisco, CA, December, 2004.
– "Resilient Distributed Datasets", 9th USENIX conference on Networked Systems Design and
Implementation (NSDI), San Jose, CA, April 2012
-
Method for reading papers:
-
Skim the paper and get the gist
-
Come back for a deep read
-
Look at sample questions and find answers in the paper
-
Heinis will deal with scalable data
-
Prepare and work on research papers in lectures and seminars, as one of the courseworks is
answering questions on a paper
-
The exam will also have a paper-based question
-
Resources:
– “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable,
and Maintainable Systems”, Martin Kleppmann, O'Reilly Media, September 2014:
-
Focuses more on the data management side
-
Recommended
– “The Art of Scalability: Scalable Web Architecture, Processes and Organizations for the Modern
Enterprise”, Martin L. Abbott, Michael T. Fisher, Addison Wesley, 1st Edition, December 2009:
-
A little more high-level
-
A little outdated
-
Blogs:
– http://highscalability.com/
– http://www.allthingsdistributed.com/ (Werner Vogel’s blog)
– http://perspectives.mvdirona.com/ (James Hamilton’s blog)
-
Spanner is the hardest paper covered
Scalable Distributed Systems
-
Mainframe:
-
Single point of failure
-
Does not scale incrementally
-
Slow if used as a CDN
-
Data Centres:
-
Scale out - horizontal
-
Types of Scalable Systems:
-
Online and user-facing (latency of < 100 ms)
-
Batch processing systems (> 1 hr)
-
Hadoop, Spark
-
Offline data processing
-
Nearline systems (< 1 sec)
-
Dynamic content presented to users
-
CDN-ed content
-
Prediction, recommendations, etc..
-
Design principles:
• Stateless services
• Caching
• Partition/aggregation pattern
• Weaker consistency
• Efficient failure recovery
Missed
BigTable discussion
BigTable
Dynamo discussion
Dynamo
Spanner discussion
Spanner
MapReduce discussion
MapReduce
Spark discussion
Spark
Oh the pain. The pain. It always rains. In my soul
Zookeper Notes
This could be a really good exam question (C) Tomas Heinis
How can we make a data structure efficcent for main memory write/read -heavy loads?
Answers are on the Cache-Sensitive Search Tree slides
ACA is king
READ the DBMS book
Chapter 7: Storage
Chapter 8: Indexes
Chapter 18: Transactions